This report explores a tidy dataset containing 1599 observations of red wines and their corresponding score by wine critics. The dataset also contains 11 variables indicating the chemical properties of the wine. In this analysis, we will explore the relationship between quality of wine (as determined by critic score) and their chemical propertities, with the hope that we can use a linear model to predict the wine score based on its chemical propertities.
First, we look at how the wines are scored by critics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Note that critic score is a discrete variable. We can observe this by setting binwidth < 1. Interestingly, the scores received are no higher than 8 and no lower than 3, with most wines scoring between 5 and 6. The average score for red wine is merely 5.6. Next, we look at the individual chemical propertities of red wine, and explore what is their individual relationship with respect to overall wine score.
Intuitively, the fixed acidity level should have an impact on taste. When acidity level is too low, the wine can fall “flat”. If the acidity level is too high, the taste can be too sour. The distribution of wine fixed acidity level is skewed to the right, with most number of wine with fixed acidity level around 7 g/L.
For linear regression, we want to make sure the input variable is approximately linear. If we want to use fixed acidity as an independent variable in our linear regression movel, we can apply log transform so that the distribution becomes approximately normal. The downside of transformation is that the unit of fixed acidity level is no longer intuitive.
To check if fixed acidity level does have an impact on wine quality, we can plot the average wine quality conditional on fixed acidity level, as shown in the figure below. To reduce noise introduced from sampling error, we can bucket fixed acidity levels in increments of 0.5 g/mL and 1 g/mL. There does not appear to be a strong relationship between fixed acidity and wine quality. In some cases, wines with low acidity level can have a high score. But there are also cases where wines with extremely high acidity level received high scores.
Note that the conditional mean plots could be misleading if the sample size is very small. For example, there are only a few wine samples with very high fixed acidity level. So the sampling error is very high for that group. For that reason, it makes sense to look at conditional mean in conjunction with the histogram, and limit the x-axis to only ranges with enough samples.
We can also use grouped boxplot to check if there are any patterns, which shows not just sample means but also range of distribution as well as outliers. In this case, we also label the boxplots with sample size as a reminder that there may be significant sampling error for wine samples with quality below 5 and above 7.
give.n <- function(x){
return(c(y = min(x) - 0.5 * sd(x),
label = length(x)))
}
ggplot(data=wineQualityReds,
aes(x=quality, y=fixed.acidity)) +
geom_boxplot(aes(group=quality)) +
scale_x_continuous(breaks=seq(3,8,1)) +
stat_summary(fun.data = give.n, geom = "text", size=3)
Next we looked at the distribution of red wine based on their volatile acidity level. With a smaller bin size, we observed a bimodal distribution, with the first mode around 0.4 g/L and a second mode around 0.6 g/L.
The volatile acidity level generally has a negative impact on the taste of red wine, since volatile acids are formed as wine spoils and becomes vinegar. However, some wine makers seek to introduce volatile acids at very low levels, in order to add to the complexity of a wine. This could be the potential reason that there is a slight increase in the number of wines with a volatile acidity level of 0.4 g/L, while the natural fermentation process is likely to produce wine with average volatile acidity level around 0.6 g/L.
The second graph, which shows average wine quality conditional on volatile acidity level, confirms our suspiction that this chemical property may have a negative impact on wine quality. Further, we computed the correlation using Pearson’s test, which shows a weak correlation (-0.38) after excluding the tail distribution where volatile acid level is greater than 1.2 g/L.
with(subset(wineQualityReds, wineQualityReds$volatile.acidity <= 1.2),
cor.test(volatile.acidity, quality, method='pearson'))
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and quality
## t = -16.634, df = 1593, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4257275 -0.3420591
## sample estimates:
## cor
## -0.3846832
Citric acid is sometimes added to wine to if the acidity level is too low and tastes flat. Citric acid could also improve the complexity in the taste of wine, giving wine a fruity flavor. As shown in the histogram below, most wine have very little citric acid added.
By examining the average wine quality score conditional on citric acid levels, we observed that there may be a positive correlation between wine quality and citric acid level. There is a single wine sample with extremely high citric acid level (1.00 g/L) and low score (4), which we can treat as an outlier as confirmed by the histogram above.
There is indeed a weak correlation between citric acid and wine quality, at about 0.23, based on pearson’s correlation test.
with(wineQualityReds, cor.test(citric.acid, quality, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: citric.acid and quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
Chloride represents the amount of salt in the wine. Most wines has a cholorid level between 0.05 and 0.10. But a few wine samples has exceptionally high chloride levels. It appears that wine quality may be negatively impacted by chlorides. However, there are so few wine samples with high chloride level that the sampling error is also very large.
with(subset(wineQualityReds, wineQualityReds$chlorides <= 0.15),
cor.test(chlorides, quality))
##
## Pearson's product-moment correlation
##
## data: chlorides and quality
## t = -6.9828, df = 1530, p-value = 4.3e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2238519 -0.1267741
## sample estimates:
## cor
## -0.1757402
Sulfur dioxides are added to wine to prevent spoilage during the fermentation process. Intuitively, too little sulfur can lead to wine with higher volatile acids and thus worse tastes. But too much sulfur could negatively impact the taste and smell of wine. Let’s look at the data and see if it agrees with our intuition. Note that the distribution of total sulfur dioxide is heavily skewed, and we would need to transform the data before using it as an independent variable in a linear model.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
with(subset(wineQualityReds, wineQualityReds$total.sulfur.dioxide <= 165),
cor.test(total.sulfur.dioxide, quality))
##
## Pearson's product-moment correlation
##
## data: total.sulfur.dioxide and quality
## t = -8.4748, df = 1595, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2540458 -0.1601596
## sample estimates:
## cor
## -0.2075807
Note that in the plots above, we have excluded 2 wine samples with exceptionally high levels of sulfur:
## total.sulfur.dioxide quality
## 1080 278 7
## 1082 289 7
The distribution of density appears to be normal, regardless of the quality.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There appears to be some correlation between density and wine quality. That is, higher quality wine has slightly lower density.
Similar to sulfur dioxide, sulphates is also used for the preservation of wine. Unlike sulfur dioxide, sulphates do not have a strong odor and does not impact the smell and taste of wine as much. This could be the reason that average wine quality actually increases as sulphates level increases, and the correlation between quality and sulphates is non-negative. Note that the distribution of sulphates is heavily skewed to the right. Even after log transform, the distribution is still not normal.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=subset(wineQualityReds, total.sulfur.dioxide<=165),
aes(x=quality, y=sulphates)) +
geom_boxplot(aes(group=quality)) +
scale_x_continuous(breaks=seq(3,8,1)) +
stat_summary(fun.data = give.n, geom = "text", size=3)
with(subset(wineQualityReds, wineQualityReds$sulphates<=1.5),
cor.test(sulphates, quality))
##
## Pearson's product-moment correlation
##
## data: sulphates and quality
## t = 12.774, df = 1589, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2599248 0.3490798
## sample estimates:
## cor
## 0.3051708
The relationship between wine quality and alcohol is the strongest among all the chemical properties studied. This is shown in the second plot below, where we observed increasing wine quality as alcohol content increasesed.
Note that the distribution of alcohol is right skewed. In order to use alcohol as an independent variable in a linear regression model, the distribution should ideally be approximately normal. Interestingly, after log transform (or cube root transform), the distribution is still skewed to the right. A further look at the data shows that alcohol context sharply drops off below 9% by volume. It’s possible that wine with lower alcohol context are someone excluded from the sample.
with(wineQualityReds, cor.test(alcohol, quality))
##
## Pearson's product-moment correlation
##
## data: alcohol and quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
The input dataset contains 1599 observations of red wines and their corresponding score by wine critics. The dataset also contains 11 variables indicating the chemical properties of the wine.
The main feature of interest is the quality attribute, which we will use to measure how good the wine tastes. We want to know which chemical properties impact the tastes of wine, and if it is possible to forecast wine quality score using these chemical properties.
In the following sections, we will further investigate the following chemical properties and see if they can help forecasting wine quality score: - Fixed Acidity - Volatile Acidity - Citric Acid - Chlorides - Total Sulfur Dioxide - Density - Sulphates - Alcohol
There are no new variables created for this analysis.
The fixed acidity level and alcohol level were transformed using log10() transform, in order to ensure the distribution is approximately normal.
Since the dataset is already tidy, there are no additional steps required to adjust the input data.
Using ggpairs() function in the GGally package, we can examine relationship between any two pairs of variables at one glance:
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
The plot above suggests the following attributes may be interdependent: - fixed acid vs. citric acid - density vs. fixed acidity - pH vs. fixed acidity - volatile acid vs. citric acid - free sulfur dioxide vs. total sulfur dioxide - alcohol vs. density Let’s investigate them individually!
It appears that fixed acidity may be correlated with a lot of other attributes. Both citric acid and density increases with fixed acidity level. And unsurprisingly, the higher the fixed acidity level, the lower the pH level. Most wines are clustered around fixed acidity level of 6-10 mg/L.
Since total sulfur dioxide level probably includes free sulfur dioide level, it is not surprising that the two variables are highly positively correlated.
since density is dependent on percent sugar and alcohol level, it is not surprising that there is significant correlation between the two. As expected, density decreases as alcohol content increases. Note that alcohol level appears to be measured in discrete increments, or the measurement data is rounded to the nearest level.
Since fixed acidity is correlated with many other attributes we are considering to include as independent variables for our linear model, we do not need to include all these variables in our model. We will eliminate fixed acidity as a model input, because it has the weakest correlation with quality based on analysis in the previous section. For the same reason, we only include total sulfur dioxide and not free sulfur dioxide as model input.
It’s interesting that fixed acidity increases as citric acid increases, but volatile acidity decreases as citric acid increases. I wonder if citric acid is considered a subset of fixed acidity.
Based on correlation metric, the relationship between pH and fixed acidity level is the strongest.
Based on previous analysis, we know that quality is correlated with alcohol content, and alcohol content also directly impact density. These observations are best summarized in a multivariate plot, as shown below:
We can see that lower quality wines are clustered around the range covering lower alcohol content, and density decreases as alcohol content increases. It certainly makes sense to include alcohol as a model input for predicting wine score, and perhaps excluding density as a model input since it is correlated with another model input (alcohol). Let’s replace density with another chemical property that we think could be a good predictor of quality: volatile acid.
Imagine dividng the cluster of points representing wine samples into four quadants. The quadrant on the lower right hand side contains wine samples with higer alcohol content and lower volatile acidity level. As expected, these wine samples received the higest quality rating. On the contrary, the quadrant on the upper left hand side contains wine samples with higher volatile acidity level and lower alcohol content, and surely these samples received lower quality ratings.
Let’s add in a third dimension and incorporate another explanatory variable: sulfur dioxide. As shown in the plot below, there are very few good quality wines when sulfur dioxide level is above 150. When sulfur dioxide level is between 100 to 150, most wines are of lower quality, comparing to wines with sulfur dioxide level below 100.
Similarly, we observed an increase in wine quality as total sulfur dioxide decreases and sulphates increases, as shown in the plot below:
Alcohol and density are both important features to consider when trying to predict wine quality based on its chemical properties. We observed that wine with higher alcohol content and consequently lower density are generally of better quality. The relationship continue to hold as we introduce additional features such as volatile acid and citric acid into consideration. It is not obvious if introduction of additional features has strengthened the relationship between alcohol content and wine quality.
It appears that when we cut the scatterplots by level of total sulfur dioxide, there are higher proportions of wine with higher quality rating. This is counterintuitive comparing to the grouped boxplots shown earlier where wine with quality of 5 has the highest sulfur dioxide level.
We created a linear model using alcohol, volatile acidity, total sulfur dioxide, pH, and sulphates as the independent variables. These variables only accounted for 34.7% of the variance in quality.
Through grouped boxplots, we observed that the quality of a wine is correlated with the amount of alcohol content in the wine. Therefore it makes sense to include alcohol content as a feature for forecasting wine quality. Using similar methods, we have found other features that may be relevant for predicting wine scores, such as volatile acids, total sulfur dioxide, pH, and sulphates.
Using multivariate plotting techniques, we can observe the joint impact of two or more features on wine quality. In the plot above, we observed that wine quality increases as alcohol content increases and volatile acids decreases (indicated by the direction of the arrow).
The linear regression model is a very low explanatory power (R^2 = 0.347). However, this does not mean the independent variables we have selected for this model are irrelevant. As shown in the figure above, wine quality is a discrete variable, Linear regression model is not well suited for forecasting discrete dependent variable from a set of continuous independent variables.
One of the biggest struggles I had during this analysis is that the outcome we are trying to predict (wine quality) is a discrete variable. As a result, I could not directly plot independent variable versus output variable in a scatterplot to visually check if there is a clear relationship. The correlation metric calculating using Pearson’s method is also very low. We can improve this analysis by using models and methods more suitable for analyzing discrete variables.